-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: add after_option
argument to ak.zip
#1308
Conversation
This broadcasts any option types at the depth limit, and forms the RecordArray layout node below the result. Effectively, this produces `?(int64, float64)` instead of `(?int64, ?float64)`
Codecov Report
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all is great, except maybe for the name. after_option
implies a temporal sequence, which makes sense to people who know about how it's implemented—that the control flow starts at the root of the tree and walks downward, so "after" means "deep" or "nested". What this is doing is sharing the option-type among the merged arrays, so maybe "shared"?
How about this very explicit name: optiontype_outside_record
? This function always creates a record/tuple at some level, so we know which record it's talking about. Including the "type" in "optiontype" disambiguates other possible uses of "option", which can also suggest user choices, nothing to do with a type in which some values can be null. So far, this is my best suggestion.
I'm leaving this open so that you can pick another name, but once you've done that, you can squash-and-merge it. The implementation is great.
I like the explicitness of |
Long argument and function names is a cultural thing, maybe coming from languages that don't have modularization and namespaces in class types. In particular, I'm thinking about function names in Emacs Lisp |
I'm generally in favour of this - I prefer descriptive names given that we mainly read code rather than write it. Exploratory code is perhaps an exception to this rule but I can also think of cases where short parameters are just meaningless without reading the docs. 🤷 |
This empty file in the tests directory was probably unintended, right? |
Er, I didn't see that in the commit; I'm not sure how I managed to do that! Would you prefer a revert or separate commit? |
No reason to do that! I'll just remove it in #1312. It can go under the category of "general cleanup." |
@jpivarski digging this up again, I realised that I didn't handle the case of I think the simplest solution here is to replace if optiontype_outside_record and any(
isinstance(x, ak._util.optiontypes) for x in inputs
):
return None with if optiontype_outside_record and any(
isinstance(x, ak._util.optiontypes + ak._util.indexedtypes) for x in inputs
):
return None But, this makes me wonder if there's more to do here - what about Union types? Does this warrant another flag if optiontype_outside_record and any(
isinstance(x, ak._util.optiontypes + ak._util.indexedtypes) for x in inputs
):
return None
if uniontype_outside_record and any(
isinstance(x, ak._util.uniontype + ak._util.indexedtypes) for x in inputs
):
return None In other words, we treat Perhaps the most explicit behaviour is if optiontype_outside_record and not all(
isinstance(x, ak._util.listtypes + (ak.layout.NumpyArray,)) for x in inputs
):
return None But this aggregates both options under a single flag, and also would mean reverting this PR. Thoughts? |
Furthermore, what is our policy on non-functional unions? E.g. In counterpoint to this, we do show them in the type string, e.g. >>> array = ak.Array(
... ak.layout.UnionArray8_64(
... ak.layout.Index8(np.r_[0, 0, 1, 1]),
... ak.layout.Index64(np.r_[0, 1, 0, 1]),
... [
... ak.layout.NumpyArray(np.r_[0,1,2,3]),
... ak.layout.NumpyArray(np.r_[0,1,2,3]*2),
... ]
...
... )
... )
>>> array.type
4 * union[int64, int64] So I think this is really a policy question |
IndexedArray and IndexedOptionArray shouldn't be nested directly within each other. If you're encountering that, then there's a missing Do you mean IndexedArray → ListArray → IndexedOptionArray or similar? Or do you mean that IndexedArrays should be considered equivalent to IndexedOptionArray when deciding whether to Why not decide whether to |
We don't need a lot of support for manipulating unions, only as much as users ask for. They turn out to be very limited in how they can be used. It might have been a reasonable choice at the beginning of Awkward Array to have ignored them entirely. Unions and records are the two types that can have lots of children. Thinking along these lines would probably make the combinatorics explode. Let's just leave the unions-in- |
Unions whose children are mergeable are not fully valid, in the sense that our functions should not be returning such things to users. If a function is returning such a thing, then it's missing a >>> array.layout
<UnionArray8_64>
<tags><Index8 i="[0 0 1 1]" offset="0" length="4" at="0x55d886a12130"/></tags>
<index><Index64 i="[0 1 0 1]" offset="0" length="4" at="0x55d887155a80"/></index>
<content tag="0">
<NumpyArray format="l" shape="4" data="0 1 2 3" at="0x55d886f9d370"/>
</content>
<content tag="1">
<NumpyArray format="l" shape="4" data="0 2 4 6" at="0x55d886dc6f30"/>
</content>
</UnionArray8_64>
>>> array.layout.simplify()
<NumpyArray format="l" shape="4" data="0 1 0 2" at="0x55d886f8c870"/> After flip-flopping around on this question in the early days, I've decided it's better to leave expressions like Why allow unions whose children are mergeable to exist at all? They're needed as intermediate states in calculations. If the UnionArray constructor raised an error when it sees them, there are some operations that would be difficult or impossible to implement. The same is true of |
Is the IndexedArray change significant enough to stop the 1.8.0 release? If this were going into v2, I would say, "no," but it's affecting v1, which people are using. Does it introduce the wrong behavior, which we would then have to backtrack? |
Thanks for the detailed replies!
I think this gets to the crux of the issue - do we want to support "quirky" user-crafted layouts? I can construct If we do want to support "quirky" layouts, then yes, we could also
Yes, agreed. That's too brittle! And, for posterity, I haven't proposed it :)
w.r.t the need to test |
Well, if the user is constructing things with We're not supporting quirky layouts in the sense of doing reasonable things if we encounter them. This is also related to the "inside Awkward/outside Awkward" boundary in #1327. It somehow needs to be a less permeable boundary than there is between any other function calls, though I don't think any language has such a concept built-in. In addition to highlighting that edge in the stack trace, we'd also like to allow "semi-invalid" IndexedArray states inside and not outside. Maybe that's something the Maybe instead, layouts could note whether they're "semi-invalid" when they're constructed—just an extra boolean for each node—and Actually, the closest to this concept of "temporarily invalid" that I've seen is in Clojure, which is a strictly functional language that allows temporarily mutable data within the scope of a function.
Okay, we do assume (not just here) that there are no quirky/semi-invalid layouts. So I'll continue with the 1.8.0 release. Virtual arrays are another thing. You know, Awkward 0.x had more node types that caused even more troubles than these. It's a slow whittling process of finding a set that play nicely with each other. |
I can only imagine.
Does this mean that we just ignore virtual arrays here? The user will only notice this is they pass this new flag and all option-containing arrays are virtual. |
Yes, it sounds like we're discussing safe and unsafe layouts which, for the most part, corresponds to "inside" or "outside" awkward. I think something like this would be quite useful actually. Given that layouts are immutable, we could just use a flag to determine whether the layout was generated by Awkward or by a user. In the latter case, we would need to validate the structure (like packing, but not so concerned about the packing part) whenever Awkward touches such a layout. Is this what you're thinking of? |
Ignoring them probably means they'll be materialized, right? That was always the problem with VirtualArrays, that the least thing would cause them to materialize unintentionally. This would only come up if someone uses the new
Right: the value of the flag at the top node tells you whether there are any interior nodes that are not correctly constructed, regardless of whether it was made by us (a function returns without calling Every option-type and IndexedArray constructor would contain: self._quirky = content._quirky or content.is_OptionType or content.is_IndexedType and the I'm not saying that this is something that we should do. As I said, there are other ways to construct invalid arrays by filling Indexes with bad data, and there is no way to check these that isn't O(N) in the length of the array (big N). The Yes, it does: here's The last test checks for this condition. |
Only if one of inputs is an option. Otherwise, no inputs are an option, and so the record is built with the current set of inputs as contents. RE quirkiness (forgive me for a terrible name, I'm going to call it "trust" from now on, though it's really just a question of does 'simplify' return the same layout), my motivation is simplifying layout visiting code (I.e most of awkward). If we don't need to handle cases like this and the recent nested option type Indeed this is something we can compute on the fly, but given that most users should never encounter such layouts (awkward should never produce them itself), it makes sense to keep trust around rather than recomputing. It's out of scope for this PR of course, but are you comfortable with imposing such constraints on layouts (I.e you can't just nest any content inside any other)? |
Most of this is outside the scope of the PR. We can already assume that option-types and IndexedArrays are not directly nested within each other. It is the responsibility of operations to return valid layouts, not to check incoming layouts for validity, because the latter (defensive programming) would require the same checks to be performed repeatedly. Reading what you and I wrote on #1193, I'm not sure that I recognized that the error only occurred when If we're going to check it at all, the internal booleans is the least expensive in time. The flag should be completely transient, not taken as an argument (as nplike is), exposed as a property, pickled, etc. because it can always be recreated. It's like a kind of cache. I think "trust" is also an unclear word for describing what it is—how about The equivalent |
This broadcasts any option types at the depth limit, and forms the RecordArray layout node below the result.
Effectively, this produces
?(int64, float64)
instead of(?int64, ?float64)
This feature is particularly useful when rezipping
RecordArray
s with?{...}
types. Without it, afterzip
ping the contents of aRecordArray
back together, we would end up with{?..., ?...}
types.P.S. I'm not sure on the naming of the argument to
ak.zip
. Feel free to suggest something clearer / shorter :)